tokfandomcom-20200215-history
Computer architecture
The template for all modern computers is the , detailed in a by Hungarian mathematician . This describes a design architecture for an electronic with subdivisions of a consisting of an and s, a containing an and , a to store both data and , external , and mechanisms. The meaning of the term has evolved to mean a in which an instruction fetch and a data operation cannot occur at the same time because they share a common . This is referred to as the and often limits the performance of the system. Processors CPU A central processing unit (CPU), also called a central processor or main processor, is the ry within a that carries out the of a by performing the basic , logic, controlling, and (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a , more specifically to its processing unit and (CU), distinguishing these core elements of a computer from external components such as and circuitry. The form, , and implementation of CPUs have changed over the course of their history, but their fundamental operation remains almost unchanged. Principal components of a CPU include the (ALU) that performs arithmetic and s, s that supply s to the ALU and store the results of ALU operations and a control unit that orchestrates the fetching (from memory) and execution of instructions by directing the coordinated operations of the ALU, registers and other components. ALU An arithmetic logic unit (ALU) is a that performs and s on s. This is in contrast to a (FPU), which operates on numbers. An ALU is a fundamental building block of many types of computing circuits, including the (CPU) of computers, FPUs, and s (GPUs). A single CPU, FPU or GPU may contain multiple ALUs. The inputs to an ALU are the data to be operated on, called s, and a code indicating the operation to be performed; the ALU's output is the result of the performed operation. In many designs, the ALU also has status inputs or outputs, or both, which convey information about a previous operation or the current operation, respectively, between the ALU and external s. FPU A floating-point unit (FPU, colloquially a math coprocessor) is a part of a system specially designed to carry out operations on s. Typical operations are , , , , , and ing. Some systems (particularly older, -based architectures) can also perform various s such as or calculations, though in most modern processors these are done with software library routines. GPU A graphics processing unit (GPU) is a specialized designed to rapidly manipulate and alter to accelerate the creation of s in a intended for output to a . GPUs are used in s, s, s, s, and s. Modern GPUs are very efficient at manipulating and . Their highly parallel structure makes them more efficient than general-purpose s (CPUs) for s that process large blocks of data in parallel. In a personal computer, a GPU can be present on a or embedded on the . In certain CPUs, they are embedded on the CPU . The term "GPU" was coined by in reference to the console's -designed in 1994. The term was popularized by in 1999, who marketed the as "the world's first GPU". It was presented as a "single-chip processor with integrated , and rendering engines". Rival coined the term "visual processing unit" or VPU with the release of the in 2002. Memory DRAM Dynamic random-access memory (DRAM) is a type of that stores each of data in a separate tiny within an . The capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The on the capacitors slowly leaks off, so without intervention the data on the chip would soon be lost. To prevent this, DRAM requires an external circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to (SRAM) which does not require data to be refreshed. Unlike , DRAM is (vs. ), since it loses its data quickly when power is removed. However, DRAM does exhibit limited . DRAM is widely used in digital electronics where low-cost and high-capacity memory is required. One of the largest applications for DRAM is the (colloquially called the "RAM") in modern s and s (where the "main memory" is called the graphics memory). It is also used in many portable devices and consoles. In contrast, SRAM, which is faster and more expensive than DRAM, is typically used where speed is of greater concern than cost and size, such as the in s. Due to its need of a system to perform refreshing, DRAM has more complicated circuitry and timing requirements than SRAM, but it is much more widely used. The advantage of DRAM is the structural simplicity of its s: only one and a capacitor are required per bit, compared to four or six transistors in SRAM. This allows DRAM to reach very high , making DRAM much cheaper per bit. The transistors and capacitors used are extremely small; billions can fit on a single memory chip. Due to the dynamic nature of its , DRAM consumes relatively large amounts of power. Synchronous dynamic random-access memory Synchronous dynamic random-access memory (SDRAM) is any (DRAM) where the operation of its external pin interface is coordinated by an externally supplied . DRAM s (ICs) produced from the early 1970s to early 1990s used an asynchronous interface, in which input control signals have a direct effect on internal functions only delayed by the trip across its semiconductor pathways. SDRAM has a synchronous interface, whereby changes on control inputs are recognised after a rising edge of its clock input. In SDRAM families standardized by , the clock signal controls the stepping of an internal that responds to incoming commands. These commands can be pipelined to improve performance, with previously started operations completing while new commands are received. The memory is divided into several equally sized but independent sections called s, allowing the device to operate on a memory access command in each bank simultaneously and speed up access in an fashion. This allows SDRAMs to achieve greater concurrency and higher data transfer rates than asynchronous DRAMs could. means that the chip can accept a new command before it has finished processing the previous one. For a pipelined write, the write command can be immediately followed by another command without waiting for the data to be written into the memory array. For a pipelined read, the requested data appears a fixed number of clock cycles (latency) after the read command, during which additional commands can be sent. High Bandwidth Memory High Bandwidth Memory (HBM) is a high-performance interface for from , and . HBM achieves higher bandwidth while using less power in a substantially smaller form factor than or . This is achieved by stacking up to eight DRAM s (thus being a ), including an optional base die with a memory controller, which are interconnected by s (TSVs) and s. The HBM technology is similar in principle but incompatible with the interface developed by . HBM memory bus is very wide in comparison to other DRAM memories such as DDR4 or GDDR5. An HBM stack of four DRAM dies (4 Hi) has two 128 bit channels per die for a total of 8 channels and a width of 1024 bits in total. A graphics card/GPU with four 4 Hi HBM stacks would therefore have a memory bus with a width of 4096 bits. In comparison, the bus width of GDDR memories is 32 bits, with 16 channels for a graphics card with a 512 bit memory interface. s (TSV).}} SRAM Static random-access memory (static RAM or SRAM) is a type of (RAM) that uses to store each bit. SRAM exhibits , but it is still in the conventional sense that data is eventually lost when the memory is not powered. The term static differentiates SRAM from (dynamic random-access memory) which must be periodically . SRAM is faster and more expensive than DRAM; it is typically used for while DRAM is used for a computer's . CPU cache A CPU cache is a used by the (CPU) of a to reduce the average cost (time or energy) to access from the . A cache is a smaller, faster memory, closer to a , which stores copies of the data from frequently used main s. Most CPUs have different independent caches, including and s, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2, L3, L4, etc.). All modern (fast) CPUs (with few specialized exceptions have multiple levels of CPU caches. *A few specialized CPUs, accelerators or microcontrollers do not have a cache. To be fast, if needed/wanted, they still have an on-chip that has a similar function, while software managed. In e.g. microcontrollers it can be better for hard real-time use, to have that or at least no cache, as with one level of memory latencies of loads are predictable. The first CPUs that used a cache had only one level of cache; unlike later level 1 caches, it was not split into L1d (for data) and L1i (for instructions). Almost all current CPUs with caches have a split L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split and acts as a common repository for the already split L1 cache. Every core of a has a dedicated L2 cache and is usually not shared between the cores. The L3 cache, and higher-level caches, are shared between the cores and are not split. An L4 cache is currently uncommon, and is generally on (DRAM), rather than on (SRAM), on a separate die or chip. That was also the case historically with L1, while bigger chips have allowed integration of it and generally all cache levels, with the possible exception of the last level. Each extra level of cache tends to be bigger and be optimized differently. Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the (TLB) that is part of the (MMU) that most CPUs have. Caches are generally sized in powers of two: 4, 8, 16 etc. or (for larger non-L1) sizes, although the has a 96 KiB L1 instruction cache. Registers In , a processor register is a quickly accessible location available to a computer's (CPU). Registers usually consist of a small amount of fast , although some registers have specific hardware functions, and may be read-only or write-only. Registers are typically addressed by mechanisms other than , but may in some cases be assigned a e.g. , . Almost all computers, whether or not, load data from a larger memory into registers where it is used for arithmetic operations and is manipulated or tested by s. Manipulated data is then often stored back to main memory, either by the same instruction or by a subsequent one. Modern processors use either or as main memory, with the latter usually accessed via one or more . Processor registers are normally at the top of the , and provide the fastest way to access data. The term normally refers only to the group of registers that are directly encoded as part of an instruction, as defined by the . However, modern high-performance CPUs often have duplicates of these "architectural registers" in order to improve performance via , allowing parallel and . Modern design acquired these techniques around 1995 with the releases of , , , and . A common property of s is , which refers to accessing the same values repeatedly and holding frequently used values in registers to improve performance; this makes fast registers and caches meaningful. Allocating frequently used variables to registers can be critical to a program's performance; this is performed either by a in the phase, or manually by an programmer. Program counter The program counter (PC), commonly called the instruction pointer (IP) in and s, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, is a that indicates where a is in its sequence. In most processors, the PC is incremented after fetching an , and holds the of (" to") the next instruction that would be executed. (In a processor where the incrementation precedes the fetch, the PC points to the current instruction being executed.) Processors usually fetch instructions sequentially from memory, but control transfer instructions change the sequence by placing a new value in the PC. These include (sometimes called jumps), calls, and . A transfer that is conditional on the truth of some assertion lets the computer follow a different sequence under different conditions. A branch provides that the next instruction is fetched from elsewhere in memory. A subroutine call not only branches but saves the preceding contents of the PC somewhere. A return retrieves the saved contents of the PC and places it back in the PC, resuming sequential execution with the instruction following the subroutine call. Stack pointer A stack register is a computer central whose purpose is to keep track of a . In , a call stack is a that stores information about the active s of a . This kind of stack is also known as an execution stack, program stack, control stack, run-time stack, or machine stack, and is often shortened to just "the stack". Although maintenance of the call stack is important for the proper functioning of most , the details are normally hidden and automatic in s. Many computer s provide special instructions for manipulating stacks. A call stack is used for several related purposes, but the main reason for having one is to keep track of the point to which each active subroutine should return control when it finishes executing. An active subroutine is one that has been called but is yet to complete execution after which control should be handed back to the point of call. Such activations of subroutines may be nested to any level (recursive as a special case), hence the stack structure. If, for example, a subroutine DrawSquare calls a subroutine DrawLine from four different places, DrawLine must know where to return when its execution completes. To accomplish this, the following the that jumps to DrawLine, the , is pushed onto the call stack with each call. Register renaming In a , programs are composed of instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say: “add x and y and put the result in z ”. In this instruction, x , y and z are the names of storage locations. In order to have a compact instruction encoding, most processor instruction sets have a small set of special locations which can be referred to by special names: registers. For example, the x86 instruction set architecture has 8 integer registers, has 16, many s have 32, and IA-64 has 128. In smaller processors, the names of these locations correspond directly to elements of a . renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. While a programmer in refers for instance to a logical register accu, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and can not be referenced directly but only via the canonical names. This technique is used to eliminate false arising from the reuse of registers by successive that do not have any real data dependencies between them. Machine code Machine code is a computer program written in machine language that can be executed directly by a 's (CPU). Each instruction causes the CPU to perform a very specific task, such as a load, a store, a , or an operation on one or more units of data in s or memory. Machine code is a strictly numerical language which is intended to run as fast as possible, and may be regarded as a primitive and -dependent . While it is possible to write programs directly in machine code, it is tedious and error prone to manage individual bits and calculate numerical addresses and constants manually. For this reason, programs are very rarely written directly in machine code in modern contexts, but may be done for low level , program (especially when assembler source is not available) and . A much more readable rendition of machine language, called , uses s to refer to machine code instructions, rather than using the instructions' numeric values directly. For example, on the processor, the machine code 00000101, which causes the CPU to decrement the B , would be represented in assembly language as DEC B. Relationship to microcode In some s, the machine code is implemented by an even more fundamental underlying layer called . Microcode typically resides in special high-speed memory and translates machine instructions, data or other input into sequences of detailed circuit-level operations. It separates the machine instructions from the underlying so that instructions can be designed and altered more freely. It also facilitates the building of complex multi-step instructions, while reducing the complexity of computer circuits. Writing microcode is often called microprogramming and the microcode in a particular processor implementation is sometimes called a microprogram. More extensive microcoding allows small and simple s to more powerful architectures with wider , more s and so on, which is a relatively simple way to achieve software compatibility between different products in a processor family. Pipeline One of the most significant characteristics of RISC (reduced instruction set computer) processors was that external memory was only accessible by a load or store instruction. All other instructions were limited to internal registers. This simplified many aspects of processor design: allowing instructions to be fixed-length, simplifying pipelines, and isolating the logic for dealing with the delay in completing a memory access (cache miss, etc.) to only two instructions. This led to RISC designs being referred to as load/store architectures. All RISC opcodes are the same length. This is an inefficient use of memory and therefore bandwidth. Using Data compression techniques the more common instructions can be made shorter. The result is CISC (complex instruction set computer) processors. This makes the computer faster but at the expense of requiring a much more complex decoder to decompress the opcodes. The resulting decompressed instruction is sent to the core to be executed. The core of a CISC processor is basically a just a RISC processor. CISC (complex instruction set computer) processors have more complex instructions of variable-length. The compact nature of such instruction sets results in higher code density, smaller program sizes, and fewer (slow) main memory accesses but they also require more pipeline stages to decode the instructions. Many designs include pipelines as long as 7, 10 and even 20 stages. * The comprises: *# Instruction fetch *# Instruction decode and register fetch *# Execute *# Memory access *# Register write back To the right is a generic pipeline with four stages: fetch, decode, execute and write-back. The top gray box is the list of instructions waiting to be executed, the bottom gray box is the list of instructions that have had their execution completed, and the middle white box is the pipeline. The execution is as follows: Pipeline bubble A pipelined processor may deal with hazards by stalling and creating a bubble in the pipeline, resulting in one or more cycles in which nothing useful happens. In the illustration at right, in cycle 3, the processor cannot decode the purple instruction, perhaps because the processor determines that decoding depends on results produced by the execution of the green instruction. The green instruction can proceed to the Execute stage and then to the Write-back stage as scheduled, but the purple instruction is stalled for one cycle at the Fetch stage. The blue instruction, which was due to be fetched during cycle 3, is stalled for one cycle, as is the red instruction after it. Because of the bubble (the blue ovals in the illustration), the processor's Decode circuitry is idle during cycle 3. Its Execute circuitry is idle during cycle 4 and its Write-back circuitry is idle during cycle 5. When the bubble moves out of the pipeline (at cycle 6), normal execution resumes. But everything now is one cycle late. It will take 8 cycles (cycle 1 through 8) rather than 7 to completely execute the four instructions shown in colors. References Category:Computer science